Introduction to Open Data Science - Course Project

About the project

The course is about learning to use R with statistics in, with an emphasis on using it with github. My objective for attending the course is refreshing my knowledge on R and statistics in general. The course has been on my “TODO” plan on PhD studies for a while.

I think the R Data Health book works well for an introduction, although - as I’m not a complete novice. I didn’t read the chapters quite thoroughly - there’s a lot to take in in short time (I missed the exercises in chapter 3, but not chapter 4). Perhaps the book works more as a sample now and later as a reference.

There have been quite a few new things since I last used R, like the tidyverse pipe. I’m glad there is a possibility of using the “language-agnostic standard” of using = for assignment instead of (or in addition to) <-, which I’ve always found odd. A lot of things are similar to features in pandas in Python data analysis (not surprising, as pandas has taken quite a bit of inspiration from R).

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Sat Dec  2 21:11:07 2023"

GitHub repository: https://github.com/dustedmtl/IODS-project


Chapter 2: Regression and model validation

Data analysis

date()
## [1] "Sat Dec  2 21:11:07 2023"

The dataset that is read in contains averaged survey scores for three subjects, student attitude, exam points, age, and gender. There are 166 students.

lr2 <- read.table('data/learning2014.csv', sep=',', header = T)
dim(lr2)
## [1] 166   7
summary(lr2)
##     gender               age           attitude         points     
##  Length:166         Min.   :17.00   Min.   :14.00   Min.   : 7.00  
##  Class :character   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:19.00  
##  Mode  :character   Median :22.00   Median :32.00   Median :23.00  
##                     Mean   :25.51   Mean   :31.43   Mean   :22.72  
##                     3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:27.75  
##                     Max.   :55.00   Max.   :50.00   Max.   :33.00  
##       deep            surf            stra      
##  Min.   :1.583   Min.   :1.583   Min.   :1.250  
##  1st Qu.:3.333   1st Qu.:2.417   1st Qu.:2.625  
##  Median :3.667   Median :2.833   Median :3.188  
##  Mean   :3.680   Mean   :2.787   Mean   :3.121  
##  3rd Qu.:4.083   3rd Qu.:3.167   3rd Qu.:3.625  
##  Max.   :4.917   Max.   :4.333   Max.   :5.000
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
p <- ggpairs(lr2, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))

# draw the plot
p

The score variables deep, surf and stra are distributed between 1 and 5.

Correlations in the data:

Linear regression

Linear regression is used to determine relation between a response variable and one of more (if multiple regression) explanatory variables. For a single variable, the equation to solve is:

\(Y = \alpha + \beta X + \epsilon\)

where all the terms are \(N\)-dimensional vectors (for \(N\) observations). \(Y\) is the response variable and \(X\) is the explanatory variable. The \(\epsilon\) error term is assumed to normally distributed with a mean of 0.

The task is to find such coefficients for \(\alpha\) and \(\beta\) that minimize the sum \(\sum_{i=1}^{N} \epsilon_i^2\). This is called the least squares method.

For multiple linear regression the equation would be similarly \(Y = \alpha + X\beta + \epsilon\), where \(X\) would be a matrix of dimensions \(N * k\) (with \(k\) explanatory variables) instead of a \(N * 1\)-dimensional vector.

Linear regression model for variable points based on three explanatory variables: attitude, stra and surf:

model <- lm(points ~ attitude + stra + surf, data = lr2)

summary(model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lr2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## attitude     0.33952    0.05741   5.913 1.93e-08 ***
## stra         0.85313    0.54159   1.575  0.11716    
## surf        -0.58607    0.80138  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

The test that the fitting is done is omnibus F-test, which tests the hypothesis that coefficients for all three variables are zero. The very low p-value for intercept means that this hypothesis is not likely to be true.

The p-value for attitude (below 0.05) shows that the model fitting is reliable for it.

Multiple R squared value of 0.2074 means that the three variables account for 20 % of variation.

# All four in the same plot
par(mfrow = c(2,2))

plot(model, which = c(1,2,5))

Above are three plots regarding residuals, where residual is a difference between the fitted and actual value.

The first plot (residuals vs fitted) is symmetrical and there is no correlation between residual and fitted value. Linear regression model is appropriate here.

The second plot shows quantiles against each other. This plot is linear too, so linear regression model is still valid.

The last plot uses standardized residuals with identical variance. The result is similar to plot 1.


Chapter 3: Logistic regression

date()
## [1] "Sat Dec  2 21:11:09 2023"

In the previous section we covered linear regression. Here we consider some cases where its use may be inappropriate. In particular, linear regression relies on the response variable being normally distributed. It might not even be continuous. In the case of a binary response variable we would prefer to use logistic regression

\(log(\frac p {1-p}) = \alpha + \beta * X\)

where \(p\) is the probability of one outcome and \(1-p\) probability of the other outcome, with \(\beta\) and \(X\) being coefficient and variable vectors for explanatory variables, which are multiplied element-wise.

Here \(\frac p {1-p}\) are the odds for one outcome against another outcome (the usability of using odds instead of probabilities comes from fact that the fitted lines featuring the former are expected to be straight)

Data analysis

The data set for analysis comes from studies about Portuguese students’ alcohol consumption:

http://www.archive.ics.uci.edu/dataset/320/student+performance

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
alc <- read.table('data/alc_data.csv', sep=',', header = T)
glimpse(alc)
## Rows: 370
## Columns: 35
## $ school     <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP",…
## $ sex        <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "F",…
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…
## $ address    <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U",…
## $ famsize    <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", "LE…
## $ Pstatus    <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "T",…
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
## $ Mjob       <chr> "at_home", "at_home", "at_home", "health", "other", "servic…
## $ Fjob       <chr> "teacher", "other", "other", "services", "other", "other", …
## $ reason     <chr> "course", "course", "other", "home", "home", "reputation", …
## $ guardian   <chr> "mother", "father", "mother", "mother", "father", "mother",…
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1,…
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
## $ schoolsup  <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no", "n…
## $ famsup     <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "yes",…
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", "ye…
## $ nursery    <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "yes…
## $ higher     <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ internet   <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "yes",…
## $ romantic   <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "no"…
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3,…
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
## $ failures   <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
## $ paid       <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes", …
## $ absences   <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3, 9, 5,…
## $ G1         <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 14, 16,…
## $ G2         <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, 15, 16…
## $ G3         <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12, 16, 1…
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1.0,…
## $ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…

The data has been combined from mathematics and Portuguese classes:

The data has a number of variables related to grading (G1, G2 and G3), absences, extra classes (paid) and previous failures (failures). It also has information about alcohol usage, including a boolean column for high alcohol usage (high_use). The rest are a variety of either numerical or categorical metadata columns.

The objective is to analyze the relation of alcohol consumption to the various categories. The hypothesis is that high alcohol consumption will lead to

  • lower grades (especially G3)

  • more absences

The student will also have spent less time on studying (studytime) and will not have had paid classes.

Viewing the data

library(tidyr); library(dplyr); library(ggplot2)

# install.packages("patchwork")
library(patchwork)

# initialize a plot of high_use and G3
g1 <- ggplot(alc, aes(x = high_use, y = G3, col = sex))
g1 <- g1 + geom_boxplot() + ylab("final grade")

g11 <- ggplot(alc, aes(x = sex, y = G3, col = high_use))
g11 <- g11 + geom_boxplot() + ylab("final grade")

g2 <- ggplot(alc, aes(x = high_use, y = absences, col = sex))
g2 <- g2 + geom_boxplot() + ylab("absences")

g3 <- ggplot(alc, aes(x = studytime, y = high_use))
g3 <- g3 + geom_col() + ylab("high use")

g4 <- ggplot(alc, aes(x = paid, y = high_use))
g4 <- g4 + geom_boxplot() + ylab("high use")

g1 + g11 + g2 + g3

Grades are lower and there are more absences with high alcohol usage, although the effect is more pronounced for men than women. Students with high alcohol consumption spent a lot less time on studying.

gh <- ggplot(alc, aes(x = studytime))
gh <- gh + geom_histogram()

gh2 <- ggplot(alc, aes(x = studytime))
gh2 <- gh2 + geom_histogram()

gh + gh2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

table(high_use = alc$high_use, paid = alc$paid)
##         paid
## high_use  no yes
##    FALSE 140 119
##    TRUE   56  55

Those with low alcohol consumption used fewer paid classes, although the difference is not dramatic. My hypothesis was thus incorrect; it’s possible that students who use a lot of alcohol need the extra classes (perhaps to compensate for lack of study time?).

Logistic regression model

m <- glm(high_use ~ G3 + absences + studytime + paid, data = alc, family = "binomial")

# print out a summary of the model
summary(m)
## 
## Call:
## glm(formula = high_use ~ G3 + absences + studytime + paid, family = "binomial", 
##     data = alc)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.44301    0.51534   0.860 0.389984    
## G3          -0.06369    0.03725  -1.710 0.087317 .  
## absences     0.07688    0.02285   3.365 0.000766 ***
## studytime   -0.58041    0.16270  -3.567 0.000361 ***
## paidyes      0.40796    0.24619   1.657 0.097501 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 414.63  on 365  degrees of freedom
## AIC: 424.63
## 
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m)
## (Intercept)          G3    absences   studytime     paidyes 
##  0.44300723 -0.06368689  0.07687909 -0.58041234  0.40795610

The low P-values for absences and studytime indicate high correlation. Surprisingly (?) grades do not.

# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                    OR     2.5 %   97.5 %
## (Intercept) 1.5573836 0.5663598 4.305756
## G3          0.9382987 0.8717178 1.009274
## absences    1.0799115 1.0350021 1.132305
## studytime   0.5596675 0.4024071 0.762707
## paidyes     1.5037411 0.9304002 2.446804

For grades (G3) and paid classes, 1.0 is within the confidence interval, thus there is no evidence that these variables are associated with high alcohol usage.

Predictions

m2 <- glm(high_use ~ absences + studytime, data = alc, family = "binomial")

probabilities <- predict(m2, type = "response")

# add probability
alc2 <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc2 <- mutate(alc2, prediction = probability > 0.5)

table(high_use = alc2$high_use, prediction = alc2$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   248   11
##    TRUE     93   18
# tabulate the target variable versus the predictions
table(high_use = alc2$high_use, prediction = alc2$prediction) %>% prop.table() %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.67027027 0.02972973 0.70000000
##    TRUE  0.25135135 0.04864865 0.30000000
##    Sum   0.92162162 0.07837838 1.00000000
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc2$high_use, prob = alc2$prediction)
## [1] 0.2810811

The probability of incorrect predictions is 28.1 %.

table(high_use = alc2$high_use, high_absences = alc2$absences > 1) %>% prop.table() %>% addmargins()
##         high_absences
## high_use      FALSE       TRUE        Sum
##    FALSE 0.23513514 0.46486486 0.70000000
##    TRUE  0.07027027 0.22972973 0.30000000
##    Sum   0.30540541 0.69459459 1.00000000
table(high_use = alc2$high_use, low_study = alc2$studytime < 2) %>% prop.table() %>% addmargins()
##         low_study
## high_use     FALSE      TRUE       Sum
##    FALSE 0.5486486 0.1513514 0.7000000
##    TRUE  0.1864865 0.1135135 0.3000000
##    Sum   0.7351351 0.2648649 1.0000000
table(high_use = alc2$high_use, low_study_high_absence = alc2$studytime < 2 | alc2$absences > 1) %>% prop.table() %>% addmargins()
##         low_study_high_absence
## high_use      FALSE       TRUE        Sum
##    FALSE 0.19189189 0.50810811 0.70000000
##    TRUE  0.05405405 0.24594595 0.30000000
##    Sum   0.24594595 0.75405405 1.00000000

Guessing based on simple metrics such as low studytime and high absences it is not possible to get good predictions.

Cross-validation

# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc2, cost = loss_func, glmfit = m2, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2837838

Unexpectedly the cross-validation does not improve the previous result.


Chapter 4: Multivariate analysis, clustering and classification

date()
## [1] "Sat Dec  2 21:11:10 2023"

Multivariate analysis

In multivariate analysis several variables are analysed at the same. The difference between analysing a single response variable and multivariate analysis is that in the latter you do not expect to “explain” a single variable in terms of another, rather than (often) exploring connections between the variables.

It is useful to graphically explore the relations between variables, for example using boxplots, scatterplots. Normality of the data can be assessed with quantile plots.

Clustering

The purpose of clustering is to group items into mutually exlcusive groups. The members of a single group should be closer to each other than to members of other groups.

A common metric for determining the difference between items is the Euclidean distance:

\(d_{ij} = \sqrt{\sum_{k=1}^{q}(x_{ik}-x{jk})^2}\)

Another option is manhattan distance, which assumes “block-wise” traversal, i.e. you can only go up/down or north/south along the streets.

The distance metrics are obviously sensitive to different scales; for it to work the data may have to be scaled around the mean and normalized according to standard deviation:

\(scaled(x) = \frac{x - mean(x)}{ sd(x)}\)

Hierarchical clustering

In hierarchical clustering each item is iteratively connected to their closest sub-clusters (either single items or clusters previously defined based on them). There are three common ways to measure the distance between clusters:

  • Single linkage

  • Maximum linkage

  • Average linkage

In single linkage, the distance between clusters is based on the closest distance between any members of the respective clusters. Maximum and average linkage are based on maximum and average distance, respectively. Hieratchical clustering produces a tree-like dendrogram.

K-means clustering

The k-means algorithm seeks group items into \(k\) clusters based on some criteria. Distance metrics (over all items) may be used. The issue with this method is that for a large amount of data, calculating the overall distances may be computationally infeasible. An alternative is to choose some initial clustering and then iteratively make small changes to them, only keeping those changes that improve whatever criteria are used to measure “best fit”.

Data analysis

We are analysing Boston housing market data and its relations to various variables. The details about the data set are found here:

https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html

## 
## Attaching package: 'MASS'
## The following object is masked from 'package:patchwork':
## 
##     area
## The following object is masked from 'package:dplyr':
## 
##     select
## corrplot 0.92 loaded
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

In addition to median house price, the variables include: crime rate; taxes; ratio of black population; access to education, employment and transportation, among other things.

Renaming the variables to be more descriptive:

names(Boston) = c("crime", "zone.pct", "ind.pct",
                  "river", "nox.ppm",
                  "rooms.avg", "age.pct",
                  "dist", "roads.idx", "tax",
                  "pupil.ratio", "black",
                  "stat.pct", "price")

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crime      : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zone.pct   : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ ind.pct    : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ river      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox.ppm    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rooms.avg  : num  6.58 6.42 7.18 7 7.15 ...
##  $ age.pct    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dist       : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ roads.idx  : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax        : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ pupil.ratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black      : num  397 397 393 395 397 ...
##  $ stat.pct   : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ price      : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
head(Boston)
##     crime zone.pct ind.pct river nox.ppm rooms.avg age.pct   dist roads.idx tax
## 1 0.00632       18    2.31     0   0.538     6.575    65.2 4.0900         1 296
## 2 0.02731        0    7.07     0   0.469     6.421    78.9 4.9671         2 242
## 3 0.02729        0    7.07     0   0.469     7.185    61.1 4.9671         2 242
## 4 0.03237        0    2.18     0   0.458     6.998    45.8 6.0622         3 222
## 5 0.06905        0    2.18     0   0.458     7.147    54.2 6.0622         3 222
## 6 0.02985        0    2.18     0   0.458     6.430    58.7 6.0622         3 222
##   pupil.ratio  black stat.pct price
## 1        15.3 396.90     4.98  24.0
## 2        17.8 396.90     9.14  21.6
## 3        17.8 392.83     4.03  34.7
## 4        18.7 394.63     2.94  33.4
## 5        18.7 396.90     5.33  36.2
## 6        18.7 394.12     5.21  28.7

The range of the values of the data vary; some are percentages, some are ratios, the rest have a variety of positive values (mainly above 1).

cor_matrix <- cor(Boston) %>% round()
corrplot(cor_matrix,
         method="circle", type = "upper",
         cl.pos = "b",
         tl.pos = "d", tl.cex = 0.6)

The correlation matrix shows positive correlations between crime and taxation and road access. The house price is positively correlated with average number of rooms (there is no variable for house size itself) and negatively correlated with pupil-teacher ratio (i.e. fewer teachers per pupil) and lower status of the population.

Access to river doesn’t seem to be correlated with anything.

To be able to properly analyze the data, we need to scale it.

boston_scaled <- as.data.frame(scale(Boston))
boston_scaled$crime <- as.numeric(boston_scaled$crime)
summary(boston_scaled$crime)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110

We create a categorical variable for the crime rates.

bins <- quantile(boston_scaled$crime)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim,
             breaks = bins,
             include.lowest = TRUE,
             labels = c("low", "med_low", "med_high", "high")
             )
boston_scaled <- dplyr::select(boston_scaled, -crime)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
head(boston_scaled)
##     zone.pct    ind.pct      river    nox.ppm rooms.avg    age.pct     dist
## 1  0.2845483 -1.2866362 -0.2723291 -0.1440749 0.4132629 -0.1198948 0.140075
## 2 -0.4872402 -0.5927944 -0.2723291 -0.7395304 0.1940824  0.3668034 0.556609
## 3 -0.4872402 -0.5927944 -0.2723291 -0.7395304 1.2814456 -0.2655490 0.556609
## 4 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.0152978 -0.8090878 1.076671
## 5 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.2273620 -0.5106743 1.076671
## 6 -0.4872402 -1.3055857 -0.2723291 -0.8344581 0.2068916 -0.3508100 1.076671
##    roads.idx        tax pupil.ratio     black   stat.pct      price crime
## 1 -0.9818712 -0.6659492  -1.4575580 0.4406159 -1.0744990  0.1595278   low
## 2 -0.8670245 -0.9863534  -0.3027945 0.4406159 -0.4919525 -0.1014239   low
## 3 -0.8670245 -0.9863534  -0.3027945 0.3960351 -1.2075324  1.3229375   low
## 4 -0.7521778 -1.1050216   0.1129203 0.4157514 -1.3601708  1.1815886   low
## 5 -0.7521778 -1.1050216   0.1129203 0.4406159 -1.0254866  1.4860323   low
## 6 -0.7521778 -1.1050216   0.1129203 0.4101651 -1.0422909  0.6705582   low
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
## Using crime as id variables
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The newly scaled variables have a mean of zero.

Train and test sets

n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

Linear discriminant analysis (LDA)

LDA is used to reduce the dimensions of the data. Here we want to see which variables are of most relevant to crime rates.

lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2400990 0.2623762 0.2400990 0.2574257 
## 
## Group means:
##             zone.pct    ind.pct       river    nox.ppm   rooms.avg    age.pct
## low       0.91775959 -0.9238915 -0.06938576 -0.8904449  0.48247396 -0.9283360
## med_low  -0.07626688 -0.2732250 -0.01233188 -0.5560333 -0.13335915 -0.3459839
## med_high -0.37054365  0.1935858  0.17414622  0.4229112  0.08587135  0.4154769
## high     -0.48724019  1.0170690 -0.04518867  1.0562128 -0.46857744  0.7910589
##                dist  roads.idx        tax pupil.ratio      black    stat.pct
## low       0.8480576 -0.6811386 -0.7699980 -0.46088995  0.3814883 -0.79318701
## med_low   0.3755945 -0.5452370 -0.4631497 -0.05702597  0.3186495 -0.12252381
## med_high -0.3926630 -0.4265815 -0.3140428 -0.26517540  0.1230960  0.06258688
## high     -0.8516574  1.6386213  1.5144083  0.78135074 -0.8337407  0.87681080
##                price
## low       0.58660081
## med_low  -0.01259366
## med_high  0.17107333
## high     -0.71302948
## 
## Coefficients of linear discriminants:
##                     LD1         LD2         LD3
## zone.pct     0.09094912  0.61394151 -0.96260219
## ind.pct      0.06159580 -0.22926314  0.32792122
## river       -0.04264060  0.01114121  0.14223408
## nox.ppm      0.41785090 -0.82427460 -1.30159773
## rooms.avg    0.05728181 -0.09630495 -0.21981881
## age.pct      0.20276326 -0.35374143  0.03272508
## dist        -0.08467857 -0.22581145  0.34363845
## roads.idx    3.39719304  0.92744987 -0.19568163
## tax         -0.03266278  0.05292419  0.64585704
## pupil.ratio  0.11938716 -0.05668705 -0.29630210
## black       -0.10379200  0.01316014  0.09942475
## stat.pct     0.16053935 -0.21375288  0.25929034
## price        0.04766577 -0.40803990 -0.24767531
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9539 0.0341 0.0121
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

The first component explains 95% of the variable to be analyzed. The highest coefficient for it is roads.idx, that is, index of accessiblity to radial highways, followed by nitrous oxide pollution, percentage of old houses and lower status of population.

# target classes as numeric
classes <- as.numeric(train$crime)

plot(lda.fit,
     dimen = 2,
     col = classes,
     pch = classes)
lda.arrows(lda.fit, myscale = 2)

The plot confirms that it is the roads access that has most relevance to crime rates.

lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       16      11        3    0
##   med_low    2      17        1    0
##   med_high   0      12       15    2
##   high       0       0        0   23
table(correct = correct_classes, predicted = lda.pred$class) %>% prop.table() %>% addmargins()
##           predicted
## correct            low     med_low    med_high        high         Sum
##   low      0.156862745 0.107843137 0.029411765 0.000000000 0.294117647
##   med_low  0.019607843 0.166666667 0.009803922 0.000000000 0.196078431
##   med_high 0.000000000 0.117647059 0.147058824 0.019607843 0.284313725
##   high     0.000000000 0.000000000 0.000000000 0.225490196 0.225490196
##   Sum      0.176470588 0.392156863 0.186274510 0.245098039 1.000000000

The method correctly classifies 74.5 % of the items.

sum(correct_classes == lda.pred$class) / length(correct_classes)
## [1] 0.6960784

K-means

Distances

# center and standardize variables
boston_scaled2 <- scale(Boston)

names(boston_scaled2) = c("crime", "zone.pct", "ind.pct",
                  "river", "nox.ppm",
                  "rooms.avg", "age.pct",
                  "dist", "roads.idx", "tax",
                  "pupil.ratio", "black",
                  "stat.pct", "price")

# change the object to data frame
boston_scaled2 = as.data.frame(boston_scaled2)

# euclidean distance matrix
dist_eu <- dist(boston_scaled2)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled2, method = "manhattan")

# look at the summary of the distances
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

The Euclidean and Manhattan distances for the scaled dataset.

km <- kmeans(boston_scaled2, centers = 3)

# plot the Boston dataset with clusters
pairs(boston_scaled2, col = km$cluster)

Bonus

km2 <- kmeans(Boston, centers = 3)

# plot the Boston dataset with clusters
pairs(Boston, col = km2$cluster)

Super-Bonus exercise

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1,
        y = matrix_product$LD2,
        z = matrix_product$LD3,
        type= 'scatter3d', mode='markers',
        color=train$crime,
        )

Some clustering is shown based on the crime categories.


Chapter 5: Dimensionality reduction techniques

date()
## [1] "Sat Dec  2 21:11:13 2023"

Principal Component Analysis (PCA)

PCA is a method for transforming a set of correlated variables into a smaller set of uncorrelated and ordered variables: \(x_1 .. x_q \to y_1 .. y_q\).

The first component \(y_1\) is the one with the highest sample variance. Written in terms of coefficients, \(y_1 = \sum_{i=1}^q a_{1i} x_i\) with the restriction that the sum of squares of the coefficients \(a_i\) is 1, that is \(\sum a_i^2 = 1\). The rest of the coefficients are determined equivalently with additional restriction that the coefficients are uncorrelated with each other, e.g., for the second component \(\sum_{i=0}^q a_{1q}a_{2q} = 0\).

The coefficients can be extracted from the covariance or correlation matrices.

Why do PCA?

First of all, reducing the dimensions allows for easier analysis. PCA is especially useful for exploration, as we shall see.

Multidimensional Scaling and Correspondence Analysis

For categorical variables, we can use Correspondence Analysis. In practical terms, the variables are converted to a correspondence matrix with the singular value decomposition method (SVD).

Data analysis

The data set for analysis is originally from the UN Development Programme:

https://hdr.undp.org/system/files/documents/technical-notes-calculating-human-development-indices.pdf

The data set contains variables related to Human Development Index (HDI) and Gender Inequality Index (GII).

The HDI calculation takes into account:

  • life expectancy

  • expected amount education

  • per capita GNI

The GII calculation takes into account:

  • infant mortality

  • female empowerment

  • labour participation rates

The data has been massages to only contain a small set of variables.

1. Data Exploration

library(corrplot)
library(tibble)
library(readr)

human <- read_csv("data/human_data_2.csv", show_col_types = FALSE)
human_ <- column_to_rownames(human, "Country")
summary(human_)
##     Edu2.FM          Labo.FM          Life.Exp        Edu.Exp     
##  Min.   :0.1717   Min.   :0.1857   Min.   :49.00   Min.   : 5.40  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30   1st Qu.:11.25  
##  Median :0.9375   Median :0.7535   Median :74.20   Median :13.50  
##  Mean   :0.8529   Mean   :0.7074   Mean   :71.65   Mean   :13.18  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :1.4967   Max.   :1.0380   Max.   :83.50   Max.   :20.20  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
str(human_)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ GNI      : num  64992 42261 56431 44025 45435 ...
##  $ Mat.Mor  : num  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...

The variables that remain are:

  • Male/female ratio for length of (and expected) education (Edu2.FM). Values above 1 mean females are educated for longer.

  • Male/female ratio for labour participation (Labo.FM)

  • Life Expectancy (Life.Exp)

  • Length of Education (Edu.Exp)

  • GNI Index (GNI)

  • Maternal Mortality Ratio (Mat.Mor). Infant deaths per 10000 births.

  • Adolescent Birth Rate (Ado.Birth)

  • Share of female parliamentary participation (percentage) (Parli.F)

# Access GGally
library(GGally)

# visualize the 'human_' variables
ggpairs(human_, progress = FALSE)

# Access corrplot
library(corrplot)
# compute the correlation matrix and visualize it with corrplot
cor(human_) %>% corrplot()

The two plots visualize the same correlation information in different ways. In the latter plot, the strength of the correlation is shown with the size of the circle and colors indicate sign of correlation (deep red = negative, deep blue = positive).

Labor market participation ratio by gender doesn’t seem to correlated with anything. Maternal mortality is strongly negatively correlated with life expectancy, female education ratio and length and positively correlated with adolescent birth rate; likewise with adolescent birth rate. The strong correlations can be seen from the scatterplots in the first plot. There are other obvious correlations with variables such as life expectancy and length of education.

Some variables have a sort-of normal looking distribution (not going to run a test to verify this); others are skewed with long tails (like GNI and maternal mortality). Both have a large number of countries in the low end.

2. Principal Component Analysis

library(tibble)
pca_human <- prcomp(human_)
# create and print out a summary of pca_human
s <- summary(pca_human)

# rounded percentanges of variance captured by each PC
pca_pr <- round(1*s$importance[2, ], digits = 5) * 100

# print out the percentages of variance
pc_lab <- print(pca_pr)
##   PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8 
## 99.99  0.01  0.00  0.00  0.00  0.00  0.00  0.00
#paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot
biplot(pca_human,
       cex = c(0.8, 1),
       col = c("grey40", "deeppink2"),
       xlab = pc_lab[1],
       ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

In the unscaled case the first component explains 99% of variance; the biplot doesn’t look useful. GNI explains it all, having the highest values by far.

3. Standardized Principal Component Analysis

human_std <- scale(human_)
pca_human_scaled <- prcomp(human_std)

# create and print out a summary of pca_human
s_scaled <- summary(pca_human_scaled)

# rounded percentanges of variance captured by each PC
pca_pr_scaled <- round(1*s_scaled$importance[2, ], digits = 5) * 100

# print out the percentages of variance
pc_lab_scaled <- print(pca_pr_scaled)
##    PC1    PC2    PC3    PC4    PC5    PC6    PC7    PC8 
## 53.605 16.237  9.571  7.583  5.477  3.595  2.634  1.298
# paste0(names(pca_pr_scaled), " (", pca_pr_scaled, "%)")
# draw a biplot
biplot(pca_human_scaled,
       cex = c(0.8, 1),
       col = c("grey40", "deeppink2"),
       xlab = pc_lab_scaled[1],
       ylab = pc_lab_scaled[2])

4. PCA interpretation

Scaling the data is quite important for as the data contains variables with varying distributions (e.g., ratios around 1 on the lower end, life expectancies in the mid-range and GNI at the higher end). The purple arrows show the direction and strength of correlation related to the components.

In the scaled case, the first two components account for around 70% of the variation. The second component seems to index female parliamentary participation percentage and labour market participation ratio and the first component the rest (either positively or negatively).

5. Tea time with Correspondence Analysis

library(dplyr)
library(tidyr)
library(ggplot2)
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)

keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")

# select the 'keep_columns' to create a new dataset
tea_time <- select(tea, keep_columns)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(keep_columns)
## 
##   # Now:
##   data %>% select(all_of(keep_columns))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# look at the summaries and structure of the data
summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...

The tea time data set contains 300 results from a questionnaire about tea consumption.

head(tea_time)
##         Tea   How     how    sugar       where     lunch
## 1     black alone tea bag    sugar chain store Not.lunch
## 2     black  milk tea bag No.sugar chain store Not.lunch
## 3 Earl Grey alone tea bag No.sugar chain store Not.lunch
## 4 Earl Grey alone tea bag    sugar chain store Not.lunch
## 5 Earl Grey alone tea bag No.sugar chain store Not.lunch
## 6 Earl Grey alone tea bag No.sugar chain store Not.lunch
pivot_longer(tea_time, cols = everything()) %>% 
  ggplot(aes(value)) + geom_bar() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8)) +
  facet_wrap("name", scales = "free")

Clearly tea time is a leisure time activity, mostly not being drunk at lunch time in a café (?). It is mostly drunk without lemon or milk.

library(FactoMineR)
mca <- MCA(tea_time, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459   0.161
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968   0.478
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898   0.141
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            2.867 |   0.433   9.160   0.338  10.053 |
## green               -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone               -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                3.226 |   1.329  14.771   0.218   8.081 |
## milk                 2.422 |   0.013   0.003   0.000   0.116 |
## other                5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag             -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged          -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |

The first dimension seems to be mostly indexing variable How (plain hot tea) and where (café or not). This is in line with the histogram plots.

plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali")

From the MCA plot, it would appear that one goes to a tea shop to drink loose tea for a tastier drink.


(more chapters to be added similarly as we proceed with the course!)